This week, we’ll calculate our own moving averages for time series and continue with network analysis.
(4 points each; 32 points total)
In Problem 3 of Lab 11, we created a time series plot of the NYC bike dataset, visualizing the number of trips per day in December 2015 as a function of time. Here, we’re going to add a moving average line to the plot.
ww, the moving average at time point tt in the time series is: mean(my_time_series[(tt-ww+1):tt]). Write a function, called moving_average that calculates the moving average of a time series vector at a single time point. The function should take three inputs: the time series, the window width, and the time point at which you want the moving average. The code is started for you below.# time_series: A vector containing the time series values
# ww: The window width for the moving average
# tt: The point at which we want the moving average (leading up to that point)
moving_average <- function(tt, time_series, ww) {
# Throw an error if the window width is too big
if (ww > length(time_series))
stop("Window width is greater than length of time series")
# If the window width is greater than the time point, return NA
if (tt < ww) return(NA)
# Your code here!
return (mean(time_series[(tt-ww+1):tt]))
}
get_moving_averages, that calculates all moving averages for a given time series vector. Your function should take two inputs: the time series and the window width. This function should call your function from part (a). The code is started for you below.# time_series: A vector containing the time series values
# ww: The window width for the moving average
get_moving_averages <- function(time_series, ww) {
# Throw an error if the window width is too big
if (ww > length(time_series))
stop("Window width is greater than length of time series")
# Your code here!
return (sapply(1:length(time_series), moving_average, time_series, ww))
}
n_trips variable in trips_per_day (use a window width of 3), and store this in a vector called bike_moving_average_3. Print the vector out. What do you notice about the first two observations?# Load in the data
library(gsheet)
bike_data <- as.data.frame(gsheet2tbl("https://docs.google.com/spreadsheets/d/143-u41MokWNU6w4_hptUypQ_sIc4ZIfEDsABq4MzDhA/edit?usp=sharing"))
# Add start_date variable to bike_data
bike_data$start_date <- as.Date(sapply(strsplit(bike_data$starttime, " "),
"[[", 1),
format = "%m/%d/%Y")
library(dplyr)
library(ggplot2)
# Summarize the bike_data, creating a new data.frame that includes the number
# of trips taken on each day
by_date <- group_by(bike_data, start_date)
trips_per_day <- summarize(by_date, n_trips = n())
# Get moving averages for n_trips
bike_moving_average_3 <- get_moving_averages(trips_per_day$n_trips, 3)
bike_moving_average_3
## [1] NA NA 30.33333 39.00000 38.33333 32.00000 28.33333
## [8] 33.33333 40.00000 69.00000 87.00000 86.33333 62.66667 46.00000
## [15] 36.33333 34.00000 31.00000 44.66667 55.66667 55.66667 42.00000
## [22] 36.33333 63.66667 81.33333 87.00000 66.33333 50.66667 49.33333
## [29] 53.00000
The first 2 observations are NA because the window width is 3, so there is not enough past data to calculate a moving average at these times.
bike_moving_average_3 variable to the trips_per_day dataset. Recreate the plot from Problem 3 in Lab 11. Now, add a second line showing the moving average in blue, with no points on the line.# Add variable to dataset
trips_per_day$bike_moving_average_3 <- bike_moving_average_3
# Create a time series plot with the dates on the x-axis and the number of
# trips per day on the y-axis
# Add moving average line
ggplot(trips_per_day, aes(x = start_date, y = n_trips)) + geom_line(aes(colour = "Original Data")) +
scale_x_date() + geom_point() +
geom_line(aes(x = start_date, y = bike_moving_average_3, colour = "Moving Average")) +
scale_colour_manual("", breaks = c("Original Data", "Moving Average"),
values = c("blue", "black")) +
ggtitle("Number of Bike Trips in December")+
labs(x = "Start Date", y = "Number of Trips")
The moving average always has a lag from the original data. That is, the moving average follows the same pattern as the data, but does so a few days later. This is because it is looking at the past 3 days of data to calculate its current value, so it is more similar to the past data than the future.
We could redefine the moving average to take the 3 (or \(w\)) closest observations in either direction, rather than the 3 (or \(w\)) previous observations to eliminate this issue.
weighted_moving_average and get_weighted_moving_averages, that mimic the functions in (a) and (b), but use a weighted moving average instead. Your functions should take the same parameters as their counterparts in parts (a) and (b), plus an additional parameter for the weighting scheme that should be used. The code is started for you below.# time_series: A vector containing the time series values
# ww: The window width for the moving average
# tt: The point at which
# weights: the weights to be used in the moving average
# Note: length(weights) should always equal ww!
weighted_moving_average <- function(tt, time_series, ww, weights = NULL) {
# Throw an error if the window width is too big
if (ww > length(time_series))
stop("Window width is greater than length of time series")
# If weights are not specified, use standard weights
if (is.null(weights)) weights <- rep(1/ww, ww)
# Throw an error if the window width is too big
if (length(weights) != ww)
stop("Weights should have the same length as the window width")
# If the window width is greater than the time point, return NA
if (tt < ww) return(NA)
# Standardize the weights so they sum to 1
weights <- weights / sum(weights)
# Your code here!
return (sum(weights * time_series[(tt-ww+1):tt]))
}
# time_series: A vector containing the time series values
# ww: The window width for the moving average
# weights: the weights to be used in the moving average
# Note: length(weights) should always equal ww!
get_weighted_moving_averages <- function(time_series, ww, weights) {
# Throw an error if the window width is too big
if(ww > length(time_series)) stop("Window width is greater than length of time series")
# If weights are not specified, use standard weights
if (is.null(weights)) weights <- rep(1/ww, ww)
# Throw an error if the window width is too big
if (length(weights) != ww)
stop("Weights should have the same length as the window width")
# Standardize the weights so they sum to 1
weights <- weights / sum(weights)
# Your code here!
return(sapply(1:length(time_series), weighted_moving_average, time_series, ww, weights))
}
weights = c(1,1,3). Add the result to the trips_per_day data.frame. Recreate your plot from (d), and add a red, dashed line (specify linetype) showing the weighted moving average on the plot.# Calculate weighted moving average of trips per day
weight_MA <- get_weighted_moving_averages(trips_per_day$n_trips, 3, weights = c(1, 1, 3))
trips_per_day$weight_MA <- weight_MA
# Create a time series plot with the dates on the x-axis and the number of
# trips per day on the y-axis
# Add moving average lines
ggplot(trips_per_day, aes(x = start_date, y = n_trips)) +
geom_line(aes(colour = "Original Data")) +
scale_x_date() + geom_point() +
geom_line(aes(x = start_date, y = bike_moving_average_3, colour = "Moving Average")) +
geom_line(aes(x = start_date, y = weight_MA, colour = "Weighted Moving Average"), linetype = 2) +
scale_color_manual(values = c("Original Data" = "black", "Moving Average" = "blue",
"Weighted Moving Average" = "red")) +
ggtitle("Number of Bike Trips in December") +
labs(x = "Start Date", y = "Number of Trips")
The weighted moving average is closer to the original data, as it weighted the more recent data higher and so takes less time to react to changes in the data since it puts more weight on them. Both moving averages are also smoother than the original data and do not have as high peaks and valleys since it is averaging over 3 time periods.
igraph, ggnetwork, and networks(4 points each; 32 points total)
The igraph package makes network analysis easy! Install and load the igraph and igraphdata packages. Load the UKfaculty network into R.
#install.packages("igraph")
#install.packages("igraphdata")
library(igraph)
library(igraphdata)
library(gridExtra)
data(UKfaculty)
Try out some of the built-in functions in the igraph package in order to summarize the UK faculty network. How many nodes (“vertices”) are in the network? (Use vcount().) How many edges (“links”) are in the network? (Use ecount().)
UKverts <- vcount(UKfaculty)
UKedges <- ecount(UKfaculty)
# plot(UKfaculty, edge.arrow.size=.4)
There are 81 vertices in the network and 817 edges in the network.
The igraph package has some built-in functions for analyzing specific nodes/vertices in the network. Use the neighbors() function to find both the in-degree and out-degree of the 11th UK faculty member. How many friends does faculty #11 claim to have? How many other faculty members claim that #11 is their friend? (See Professor Rodu’s notes for how to use the neighbors() function.)
neighbors(UKfaculty, 11, mode="in")
## + 2/81 vertices:
## [1] 46 58
neighbors(UKfaculty, 11, mode="out")
## + 0/81 vertices:
The 11th UK faculty member claims to have 0 friends, while 2 people claim that #11 is their freind.
Write a function, called get_degree, that calculates the in-degree or out-degree (number of neighbors) that a given node in a network has. Your function should take three inputs: the network, the node index/number, and the degree type (in or out). The code is started for you below.
We can easily do this using the neighbors function (see second function) or we can write it on our own (first function). You should always check your functions to make sure they work after you write them. In our case we can count the number of elements by hand using neighbors and then see if it matches the results of our get_degree function.
# node: The node index in the network
# network: The network to use in the degree calculations
# type: The degree type; must be either "in" or "out"
get_degree <- function(node, network, type) {
in.verts <- c()
for(i in 1:vcount(network)){
in.verts <- c(in.verts, unlist(network[i])[node] != 0)
}
if(type == "out"){
deg <- sum(as.vector(UKfaculty[node]) != 0)
} else{
deg <- sum(in.verts)
}
return(deg)
}
neighbors(UKfaculty, 11, mode="out")
## + 0/81 vertices:
neighbors(UKfaculty, 11, mode="in")
## + 2/81 vertices:
## [1] 46 58
get_degree(11, UKfaculty, "out")
## [1] 0
get_degree(11, UKfaculty, "in")
## [1] 2
# node: The node index in the network
# network: The network to use in the degree calculations
# type: The degree type; must be either "in" or "out"
get_degree <- function(node, network, type) {
# Your code here
return(length(neighbors(network, node, mode = type)))
}
neighbors(UKfaculty, 81, mode="out")
## + 6/81 vertices:
## [1] 1 38 59 73 74 75
neighbors(UKfaculty, 81, mode="in")
## + 4/81 vertices:
## [1] 37 59 73 74
get_degree(81, UKfaculty, "out")
## [1] 6
get_degree(81, UKfaculty, "in")
## [1] 4
Write a new function, get_all_degrees, that calculates the in- or out-degree of an entire network. Your function should call the function you wrote in part (c).
The group of apply functions in R is very useful! You can think of this functions as one-line for loops. I recommend you learn and practice using these since they are often much faster than for loops due to vectorization. If/when you take 36-350, you will learn about these in greater detail. I sometimes write code into a for loop first, because it is often more intuitive, and then rewrite it using apply, sapply, lapply, etc. I can also check my answers this way to make sure it’s doing what I want.
# network: The network to use in the degree calculations
# type: The degree type; must be either "in" or "out"
get_all_degrees <- function(network, type) {
# Your code here
return(sapply(1:vcount(network), get_degree, network, type))
}
Apply your function to the UK faculty dataset for both degree types. Create a new, three-column data.frame that contains the results. One column should contain the in-degrees, one should contain the out-degrees, and a third column should contain the node index/number.
in.degs <- get_all_degrees(UKfaculty, "in")
out.degs <- get_all_degrees(UKfaculty, "out")
node.index <- 1:vcount(UKfaculty)
InOutDegs <- data.frame(NodeIndex = node.index, InDegs = in.degs, OutDegs=out.degs)
head(InOutDegs)
## NodeIndex InDegs OutDegs
## 1 1 9 6
## 2 2 19 17
## 3 3 4 4
## 4 4 8 10
## 5 5 10 28
## 6 6 8 9
Create a scatterplot of the in-degrees vs. the out-degrees, and change the point type to be their node index/number. Describe the graph in (e). Which nodes are “overconfident” in their popularity (i.e. they claim to have many friends, but not as many others claim that they are friends with this person)? Which nodes are “underconfident” (i.e. many others claim that they are friends with this person, but this person does not claim to have many friends)?
library(ggplot2)
ggplot(InOutDegs, aes(x = InDegs, y = OutDegs)) + geom_text(aes(label = node.index)) +
geom_abline(intercept = 0, slope = 1) +
ggtitle("In vs. out degrees in UK faculty friendship")
Adding a y=x line to this graph proves to be very useful. It is easy to now see that anyone above the y=x line is overconfident because their out-degrees are greater than their in-degrees, and anyone below the y=x line is underconfident because their in-degrees are greater than their out-degrees. Anyone on the line has the same number of in and out degrees (e.g. person A claims to have 20 friends and 20 people claim to be friends with person A). We would not claim this person to be over or under confident.
Faculty members 62, 37, 29, 5, 15, 43, and 52 are especially overconfident. Faculty members 69, 18 and 54 are especially underconfident.
Finally, let’s actually visualize the network itself! Install and load the ggnetwork package. This package is brand new – it was just released on March 28th. You can read more about how to use it here. Visualize the UK faculty network. The code below should get you started (be sure to add a title, remove the x- and y-axis labels and tick marks, and adjust the legend as necessary). Feel free to adjust the graph as you see fit (not required) – see the article at the link above for more ideas. Finally, note that you’ll also need to install and load the intergraph package:
#install.packages("intergraph")
#install.packages("ggnetwork")
library(intergraph)
library(ggnetwork)
uk_data <- fortify(UKfaculty)
node_degrees <- igraph::degree(UKfaculty, mode = "in")
uk_degrees <- node_degrees[match(uk_data$vertex.names, 1:length(node_degrees))]
uk_data$degree <- uk_degrees
ggplot(uk_data,
aes(x, y, xend = xend, yend = yend)) +
geom_edges(arrow = arrow(length = unit(0.3, "lines")),
aes(color = as.factor(Group)), alpha = 0.5) +
geom_nodetext(aes(label = vertex.names, size = degree * 2),
color = "blue", fontface = "bold") +
ggtitle("Network of UK faculty friendship") +
scale_x_discrete(breaks=NULL, name="", expand = .08) +
scale_y_discrete(breaks=NULL, name="", expand = .08) +
scale_size("In Degree x 2") +
scale_color_discrete("School Affiliation")
Describe the network. How many groups (“cliques”) do there appear to be? In the above graph, to what does the size of the nodes correspond? To what does the color of the edges correspond? Is the graph a directed or an undirected graph?
We see a network of UK faculty with three main groups (although there are four school affiliations). Some people have many in and/or out degrees and some people do not. There is overlap between all of the cliques. Each node is labeled by the faculty member number and the size of the node is proportional to the number of in-degrees that member has. We can generalize and say that the bigger the node the more “popular” someone is because more people claim to be friends with them. The color of the edge corresponds to the school affiliation. This is a directed graph because arrows can go both ways. Node 1 can claim to be friends with node 2 but node 2 does not have to claim to be friends with node 1 (similar to twitter or instagram where you can follow someone without them following you back).
(30 points)
Install and load the gunsales package. This is a very new package, and its described in detail here. For some related reading to familiarize yourself with other work that’s been done using this data, see here.
The package’s analysis() function will give you very detailed dataset on gun sales in the US (it may be a good idea to use cache = TRUE in the following code block, since this calculation can be computationally intensive):
#install.packages("gunsales")
library(gunsales)
gun_data <- analysis()
## [1] "Increase in monthly gun sales in Missouri = 8773.09"
We’ll be analyzing this dataset from a time series perspective. As such, it may help to create a new date variable to help with time series plotting:
gun_data$year_month <- as.Date(paste0(gun_data$year,
ifelse(nchar(gun_data$month) == 2,
"-", "-0"),
gun_data$month, "-01"),
format = "%Y-%m-%d")
Your task: Create three graphs that you find interesting using the gun_data dataset. Here are a few constraints to get you started:
Other than that, what graphs you create are up to you!
For each of your three graphs, write 2-4 sentences describing the main results that you want your viewer to understand from the graph. Why is the graph interesting? What do you want your viewer to understand from the graph? Etc.
acf(gun_data$guns_total, main="Autocorrelation of Gun Sales by Month",
lag.max = 120)
The plot shows the estimated autocorrelation function for the time series of monthly gun sales. The autocorrelation is periodic, peaking every 12 months, indicating periodicity in the time series itself. The autocorrelations also change from all positive to all negative as the lag increases, indicating some trend in gun sales over time.
get_moving_averages <- function(ts, ww) {
averages <- numeric(length(ts) - ww + 1)
for (ii in 1:length(averages)) {
averages[ii] <- mean(ts[ii:(ii+ww-1)])
}
return(averages)
}
yearly_averages <- data.frame(avg=get_moving_averages(gun_data$guns_total, 12),
date = gun_data$year_month[6:187])
ggplot(gun_data) + geom_line(aes(x = year_month, y = guns_total,
color = "Actual Sales")) +
geom_line(data = yearly_averages, aes(x = date, y = avg,
color = "Yearly Moving Avg.")) +
labs(x = "Date", y = "Monthly Gun Sales") + ggtitle('Monthly Gun Sales') +
scale_color_manual(values=c("orange","dodger blue"), name="") + theme_bw()
This plot shows the time series of monthly gun sales along with a yearly moving average. The plot confirms the yearly periodicity shown in the estimated autocorrelation function. The moving average smoothes out this periodicity and makes it clear that there is also an increasing trend in gun sales over time. There have been two spikes in gun sales in recent years: one at the beginning of 2013 and one at the end of 2015.
In a more formal study of this data, it may be a good idea to suggest potential reasons for these spikes (e.g. changes in legislation, changes in politcal leadership, important world events, etc), as we do in the “Sales by State” section below.
one_year_averages <- data.frame(nj=get_moving_averages(gun_data$new_jersey, 12),
ms=get_moving_averages(gun_data$mississip, 12),
la=get_moving_averages(gun_data$louisiana, 12),
ga=get_moving_averages(gun_data$georgia, 12),
date = gun_data$year_month[6:187])
reg <- ggplot(gun_data) + geom_line(aes(x = year_month, y = louisiana,
color = "LA")) +
geom_line(aes(x = year_month, y = georgia, color = "GA")) +
geom_line(aes(x = year_month, y = mississippi, color = "MS")) +
geom_line(aes(x = year_month, y = new_jersey, color = "NJ")) +
labs(x="Date", y= "Percentage of National Gun Sales") + theme_bw() +
ggtitle("Gun Sales by State") +
scale_color_manual(name="State",values=c("#a6cee3","#1f78b4",
"#b2df8a","#33a02c"))
avgs <- ggplot(one_year_averages) +
geom_line(aes(x = date, y = la, color = "LA")) +
geom_line(aes(x = date, y = ga, color = "GA")) +
geom_line(aes(x = date, y = ms, color = "MS")) +
geom_line(aes(x = date, y = nj, color = "NJ")) +
labs(x="Date", y= "Percentage of National Gun Sales") + theme_bw() +
ggtitle("One Year Moving Average") +
scale_color_manual(name="State",values=c("#a6cee3","#1f78b4",
"#b2df8a","#33a02c"))
grid.arrange(reg, avgs, ncol=2)
The plots show that the three southern states make up a greater proportion of national gun sales than New Jersey, a northern state. However, the gap has decreased over time with all three southern states dropping in percentage of national sales from 2000 through 2015. The three southern states also all show a spike in sales in the aftermath of Hurricane Katrina, while New Jersey showed a spike in 2013 in the aftermath of local gun control efforts.